A pathway analysis-based algorithm for calculating the participation degree of ncRNA in transcriptome

After sequencing, it is common to screen ncRNA according to expression differences. But this may lose a lot of valuable information and there is currently no indicator to characterize the regulatory function and participation degree of ncRNA on transcriptome. Based on existing pathway enrichment methods, we developed a new algorithm to calculating the participation degree of ncRNA in transcriptome (PDNT). Here we analyzed multiple data sets, and differentially expressed genes (DEGs) were used for pathway enrichment analysis. The PDNT algorithm was used to calculate the Contribution value (C value) of each ncRNA based on its target genes and the pathways they participates in. The results showed that compared with ncRNAs screened by log2 fold change (FC) and p-value, those screened by C value regulated more DEGs in IPA canonical pathways, and their target DEGs were more concentrated in the core region of the protein–protein interaction (PPI) network. The ranking of disease critical ncRNAs increased integrally after sorting with C value. Collectively, we found that the PDNT algorithm provides a measure from another view compared with the log2FC and p-value and it may provide more clues to effectively evaluate ncRNA.


Results
The C value of each DE ncRNA is equal to the sum of BP value, CC value, MF value and KEGG value. We calculated the C value of each DE miRNA in skeletal muscle denervation, prostate cancer, Alzheimer's disease and gastric cancer data sets respectively. In addition, we calculated the C value of each lncRNA in skeletal muscle denervation and adipocyte differentiation data sets. The details of these data were aggregated into a table (Table 1). The C values of each DE ncRNA based on biological process (BP), cellular component (CC), molecular function (MF) and KEGG analysis can be obtained, and we named these C values as BP value, CC value, MF value and KEGG value respectively. The total C value of each DE ncRNA was equal to the sum of BP value, CC value, MF value and KEGG value. The DE miRNAs were sorted with the total C value to obtain the 10 DE miRNAs with maximum C value, named as top10 C value miRNAs ( Table 2). The top10 DE miRNAs with maximum absolute Log2 FC (top10 FC miRNAs), and the top10 DE miRNAs with minimum p-value (top10 p-value miRNAs), were obtained by sorting the DE miRNAs according to the absolute Log2 fold FC and p-value respectively (Supplementary Tables 1, 2). Similarly, DE lncRNAs were processed in the same way to obtain top5 C value lncRNAs, top5 FC lncRNAs, top5 p-value lncRNAs for adipocyte differentiation and top10 C value lncRNAs, top10 FC lncRNAs, top10 p-value lncRNAs for skeletal muscle denervation (Table 3, Supplementary  Tables 3-6). C value is superior to log2 FC and p-value in miRNA operation results. In each data set, the most significant enriched IPA canonical pathways were obtained by core analysis (Supplementary Table 7). We took the intersections of DEGs with the predicted target genes of top10 C value miRNAs, top10 FC miRNAs and top10 p-value miRNAs respectively, and then calculated the proportion of these intersections in the above pathways. It was found that the proportion of top10 C value miRNAs target genes was significantly larger than that of top10 FC miRNAs, top10 p-value miRNAs in most pathways (Fig. 1). We built several PPI networks based on DEGs, and calculated the degree of each node. The node with a larger degree had a darker color and was closer to the center. Then we divided these nodes into the core region (top 20% of degree), sub core region (top 20%-50% of degree) and noncore region (bottom 50% of degree) (Fig. 2a,e,i,m). In the PPI network, the predicted target genes of top10 C value miRNAs, top10 FC miRNAs and top10 p-value miRNAs were labeled in red (Fig. 2). It was found that the number of top10 C value miRNAs' target genes in each region were larger than those of top10 FC miRNAs, and top10 p-value miRNAs, and the C value group are more concentrated in core region (Fig. 3) (Table 4). www.nature.com/scientificreports/ Based on extensive literature, we identified 14 skeletal muscle growth regulatory miRNAs, 6 Alzheimer's disease associated miRNAs, 6 prostate cancer associated miRNAs, and 6 gastric cancer associated miRNAs and found that when DE miRNAs were sorted by C value, the sum of the ranks of these miRNAs was significantly smaller than that of the other two indexes, which means that these miRNAs sequences increased integrally (Fig. 4). When sorting by C value versus sorting by absolute Log2 FC/ p-value, most of the disease critical miR-NAs ranked up (Fig. 4) (Supplementary Table 8).
C value is superior to log2 FC and p-value in lncRNA operation results. In the skeletal muscle denervation data set, we calculated the proportion of the predicted target genes of top10 C value lncRNAs, top10 FC lncRNAs, and top10 p-value lncRNAs in the most enriched IPA canonical pathways respectively, and found  www.nature.com/scientificreports/ that the proportion of the genes regulated by top10 C value lncRNAs was larger than that of top10 FC lncRNAs and top10 p-value lncRNAs (Fig. 5a). It was found that the number of top10 C value lncRNAs' target genes in each region were larger than those of top10 FC lncRNAs, and top10 p-value lncRNAs and the C value group are more concentrated in the core region ( Fig. 5b-e) ( Table 5).
Since there are relatively few DE lncRNAs and DE mRNAs in the adipocyte differentiation data set, we take top5 C value lncRNAs, top5 FC lncRNAs, top5 p-value lncRNAs. The proportion of the genes regulated by top5 C value lncRNAs was larger than that of top5 FC lncRNAs and top5 p-value lncRNAs in enriched IPA canonical pathways (Fig. 6a). It was found that the number of top5 C value lncRNAs' target genes in each region were larger than those of top5 FC lncRNAs, and top5 p-value lncRNAs and the C value group are more concentrated in the core region (Fig. 6b-e) ( Table 5). And when DE lncRNAs were sorted by C value, the adipocyte differentiation associated lncRNAs sequences increased integrally than that of the other two indexes (Fig. 6f-g) (Supplementary Table 8). Gastric cancer dataset. The degree of each node was calculated. The larger the degree of the node, the darker the color and the closer the position is to the center. The top 20% nodes are defined as core regions, the top 20%-50% nodes are defined as sub core regions, and the remaining nodes are noncore regions. (b,f,j,n) Distribution of FC group in PPI network. (c,g,k,o) Distribution of p-value group in PPI network. (d,h,l,p) Distribution of C value group in PPI network. Red is the selected node, blue is the unselected. Number of genes in core region, sub core region and noncore region of each group has been tagged. STRING v11.0 was used to generate protein interactions, and the resulting network was visualized using Cytoscape v3.7.2. (FC group: the collection of the top10 FC miRNAs' predictive target mRNAs; p-value group: the collection of the top10 p-value miRNAs' predictive target mRNAs; C value group: the collection of the top10 C value miRNAs' predictive target mRNAs). lyzed, and the proportion of the C value group in the top10 pathways was calculated compared with the other two groups. We found that in miRNA data set, the efficiency of the C value group was improved by 61% compared with the FC group, and by 145% compared with the p-value group. In lncRNA data set, the C value group increased by 39% compared with the FC group, and by 78% compared with the p-value group (Table 6). Then, by analyzing the results of PPI network and calculating the ratio of the C value group in core region compared with the other two groups, we found that the C value group in miRNA data set increased by 10% compared with the FC group and by 18% compared with the p-value group. In lncRNA data set, the C value group increased by 85% compared with the FC group, and by 81% compared with the p-value group. In general, there is little difference between the results of miRNA and lncRNA, and a greater difference occurs between different data sets, which may be related to the quality of data sets (Table 7).

Discussion
After high-throughput sequencing, it is common to screen ncRNA according to expression differences. But this may lose a lot of valuable information and lead to biased results. Considering the strong regulatory function of ncRNA on gene expression, there is currently no indicator to characterize the regulatory function and  To test the superiority of C value, we compared it with absolute Log2 FC and p-value. Log 2 FC reflects the expression change of ncRNAs and p-value reflects how significant the change is. The two indexes of each DE RNA were obtained after the traditional whole transcriptome sequencing, and many follow-up studies have partially referenced Log2 FC and p-values in selecting the target gene 9,10,16 . We analyzed four microRNA data sets and two lncRNA data sets, and compared the C value with Log2 FC and p-value in each data set. First, we performed enrichment analysis on DEGs to obtain the most enriched IPA canonical pathways. We found that top C value ncRNAs targeted more genes in these pathways than FC and p-value groups, which may suggest that top C value ncRNAs have greater regulatory potential for enriched pathways. Further, we constructed a PPI network based on DEGs, partitioned the PPI by degree, and then observed the distribution of the three groups in different partitions. It was found that the number of target genes of top C value ncRNAs in each region was greater than that of the other two groups. At the same time, a larger proportion of target genes in the C value group were concentrated in the central region of the PPI. It suggests that the top C value ncRNA has a broader and more important influence on the PPI network than the other two groups. Finally, based on literature search, we obtained key ncRNAs that regulate various pathological/ physiological processes, and then tested the screening effect of the three indicators on these key ncRNAs in the datasets. It was found that using the C value to rank ncRNAs made the overall ranking of these key ncRNAs higher than the other two indicators. This suggests that ncRNAs screened with C values have a greater potential for regulating pathological/physiological processes.
In order to correct the bias caused by only considering expression differences when screening ncRNA, many analysis methods and databases have been derived, such as GSEA 11 IPA 12 , David 13 , Catmap 14 and GlobalTest 15 . Their analytical methods have different priorities, but the general idea is the same, that is, to perform functional annotation on the RNA profile. But through these methods, we can only observe which genes and pathways are associated with ncRNAs. We do not have a measure to evaluate the participation degree of ncRNA in transcriptome. This lack may result in our inability to assess the priority of two ncRNAs when their target genes are close in number. Or when the two ncRNA regulate similar pathways, we cannot judge their participation degree in the expression regulation of the transcriptome. The algorithm PDNT proposed in this study is based on these pathway analysis methods. We hope to make better use of the pathway enrichment results to evaluate ncRNA and we integrated more valuable information to optimize the screening efficiency of ncRNA. The limitation of this study is that we only calculated based on one pathway enrichment method. In the subsequent study, we www.nature.com/scientificreports/ will compare the differences between the results calculated based on different pathway enrichment methods, to provide more inspiration and help for related research. www.nature.com/scientificreports/   www.nature.com/scientificreports/ Based on the above evidence, the PDNT is an efficient algorithm for calculating the participation degree of ncRNA in transcriptome based on pathway analysis. We found that the PDNT algorithm provides a measure from another view compared with the log2FC and p-value and it may provide more clues to effectively evaluate ncRNA.
LncRNA: The target genes of lncRNAs are predicted by co-expression analysis among samples. The Weighted Gene Correlation Network Analysis (http:// www.r-proje ct. org/) 23 was used to calculate Pearson correlation coefficients. The absolute value of the Pearson correlation coefficient ≥ 0.90, p-value < 0.01 and FDR < 0.01 was saved.
GO and KEGG pathway enrichment analysis. In this study, the screening criteria for DEG were p < 0.05 and absolute Log2 FC ≥ 1.
GO is a database established by Gene Ontology consortium (http:// www. geneo ntolo gy. org), which includes three parts: molecular function, biological process and cell composition. KEGG is based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http:// www. genome. ad. jp/ kegg/), Fisher exact test and × 2 test were used. Enrichment analysis of differentially expressed genes was performed using clusterProfiler R software package 24 , and gene length bias was corrected. The corrected p-value less than 0.05 was considered to be significantly enriched by differentially expressed genes. C value mathematical model and its calculation. The C value of each DE ncRNA is calculated using the PDNT algorithm (Fig. 7): p-value is the p-value of the pathway enriched by DEGs; Proportion refers to the proportion of the intersection between ncRNA target genes and DEGs in each pathway; n represents the number of pathways enriched by DEGs.
Proportion k * (−log10(pValue) Retrieval and statistics of key miRNAs and lncRNAs. We searched PubMed (http:// www. ncbi. nlm. nih. gov/ pubmed) for miRNAs that play important roles in skeletal muscle denervation, Alzheimer's disease, prostate cancer and gastric cancer, respectively. The key words were "skeletal muscle AND microRNA", "Alzheimer's disease AND microRNA", "prostate cancer AND microRNA", and "gastric cancer AND microRNA". Next, we retrieved the lncRNAs that play an important role in skeletal muscle denervation and adipocyte differentiation. Keywords: "skeletal muscle AND lncRNA" and "adipocyte differentiation AND lncRNA". The results were shown in Table 8.
Data Analysis. The analysis platform is R 3.6.1 and the R package is clusterProfiler. The database is org.
Mm.eg.db developed with the R package.